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■© Compounding preprocessor for cache. 

5) A cigitai computer system is described capable of processing two or more computer instructions in parallel 
ana having a cacne storage unit for temporarily storing machine-level computer instructions in their journey from 
a higher-ievei storage unit of the computer system to the functional units which process the instructions. The 
computer system includes an instruction compounding unit located intermediate to the higher-level storage unit 
and the cache storage unit for analyzing the instructions and generating for to each instruction a compounding 
information which indicates whether cr not that instruction may be processed in parallel with one or more 
neighboring instructions in the instruction stream. These tagged instructions are then stored in the cache unit 
with the compounding information. The computer system further includes a plurality of functional instruction 
processing units which operate in parallel with one another. The instructions supplied to these functional units 
are obtained from the cache storage unit. At instruction issue time, the compounding information for the 
instructions is examined and those instructions indicated for parallel processing are sent to different ones of the 
functional units in accordance with the codings of their operation cede fields. 
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TECHNICAL FIELD 

30 aicitai -mputers and digital cata processors and particularly to digital 
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BACKGROUND OF THE INVENTION 

manner has improveo s.gn.ncantly m t..e past largely cue ^ » SC alar" computers or processors. 

mance by executing more than one instruction ^^^^^^^^ of instructions may 
super scalar machines typically (op codes) of the instructions and 

be executed in parallel. ^^J^^^^T^^ ^ op codes determine the 
on data dependencies wn.ch may exist between aaja«-c npnera | it is not possible for two 

partem hardwa-e concerns o. .he *>^^* h £?* o exec u » an insm**,* 

^pcTe "ca,a, provide some e^™ ^J^^^^" 



second or further time. 
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SUMMARY OF INVENTION 

As discussed in co-penciing Application Serial No. 07/519,384 (IBM Docket EN 990 020), one of the 
attributes of a Scalable Compound Instruction Set Machine (SCISM) is performance of the parallel execution 
decision prior to execution time. In SCISM architecture, the decision to execute in parallel is made at an 
earlier point in the overall instruction handling process. For example, the decision can be made ahead of the 
instruction buffer in those machines which have instruction buffers or instruction stacks. For another 
exampie. the decision can be made ahead of the instruction cache in those machines which flow the 
instructions through a cache unit. 

Another attribute of a SCISM machine is to record the results of the parallel execution decision making 
so that such results are available in the event that those same instructions are used a second or further 
time. 

In one embodiment of the present invention, the recording of the parallel execution decision making is 
accomplished by generating information in the form of tags which accompany the individual instructions in 
an instruction stream. These tags tell whether the instructions can be executed in parallel or whether they 
need to be executed one at a time. This instruction tagging process is sometimes referred to herein as 
"compounding". It serves, in effect, to combine at least two individual instructions into a single compound 
instruction for parallel processing purposes. 

In a particularly advantageous embodiment cf the present invention, the computer is one which includes 
a cache storage mechanism for temporarily storing machine instructions in their journey from a higher-levei 
storage unit of the computer to the instruction execution units of the computer. The compounding process 
is performed intermediate to the higher-level storage unit and the cache storage mechanism so that there is 
stored in the cache storage mechanism both instructions and compounding information. As is known, the 
use of a well-designed cache storage mechanism, in and of itseif, serves to improve the overall perfor- 
mance • of a computer. Further, :he storing of the compounding information into the cache storage 
mechanism enables the information to be used over and over again so long as the instructions in question 
remain in the cache storage mechanism. As is known, instructions frequently remain in a cache long 
enough to be used more than once. 

For a better understanding of the present invention, together with other and further advantages and 
features thereof, reference is made to the following description taken in connection with the accompanying 
drawings, the scope of the invention being pointed out in the appended claims. 



BRIEF DESCRIPTION OF THE DRAWINGS 



Referring to the drawings: 
Fig. 1 illustrates the location of the invention in a stream of scalar instructions. 

Figs. 2A and 2B illustrate categorization of instructions in an exemplary instruction set. 

Fig. 3 illustrates how an instruction stream is analyzed according to a set of rules 

establishing which instructions of which categories can be executed in parallel 

with instructions of other categories. 
Fig. 4 illustrates the operational environment of the invention and the invention's 

location in the environment. 
Fig. 5 illustrates the formats of instructions which are analyzed for parallel execution 

according to the invention. 

Fig. 6A and 6B form a block diagram illustrating a compounding unit according to the invention 

which analyzes instructions for parallel execution according to a set of rules 
and generates information indicating the outcome of the analysis. 

Fig. 7 is a partial block diagram illustrating how the instruction compounding unit of 

Fig. 6 analyzes two instructions. 

Figs. 8A, 8B, and 8C are timing diagrams which illustrate operation of the invention according to 

various conditions. 

Fig. 9A and 9B form a logic diagram illustrating in greater detail a ruie-based analysis compo- 

nent of the instruction compounding unit of Fig. 6. 

Fig. 10 is a block diagram of an industrial application of the invention. 

Fig. 11 is a representation of a block of instructions analyzed by the instruction 

compounding unit of Fig. 6 together with an information vector indicating the 
results of the analysis. 

Figs. l2Aand 12B are schematic diagrams illustrating cache storage of instruction blocks and 
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accomcanvma ccmoounding information. 
3 illustrates a "fragment of an instruction stream anaiyzea according to the 

,g " invention with an accompanying information vector containing the results of the 

analysis. 

is a chart which shows how the instructions of Fig. 13 are executed in response 
to the accompanying analysis information. 



Fig. 14 



DETAILED DESCRIPTION OF REFERRED EMBODIMENTS 
io Instruction Compounding 

Referring to Fig. 1 of the drawings, there is shown a representative embodiment of a portion of a digital 
como! e system or a digital data processing system constructed in accordance w.th the P-^™£m 
ThT^ustrated computer system is capable of executing two or more mstructions .n parallel. The system 

%^t£<Z**««*«* Process described herein .eaves the object code of compounded mstrucfons 

,n unaltered thereby maintaining compatabiiity w.th previously-implemented computer systems. 

° ° n order to suoport the 'parallel execution of a group o, up to N instructions, the computer ^system 
includes a plurality of instruction execution units which operate in parallel and m a concurrent manner. 

As is general shown in Fig. 1. an instruction compounding unit 20 takes a stream of binary scalar 
instructions 21 and selectively groups some of the aajacent sca.ar instructions (which would otherw,se be 

« touted singly) for para.lel execution. A resulting compounded instruction stream 22 therefore provides 
scalar instructions to be executed singly or in compound instructions formed by groups of scalar 
S^tioT^e executed in parai.el. When a sca.ar instruction is presented to an instruction processing 
unTt 24 it is routed to the appropriate one of a plurality of execution units for serial execution. When 
mpounSed ins^uct.ons are presented to the instruction processing unit 24. each * J*^^^ 

30 is routed to an appropriate execution unit for simultaneous parallel execution with the others. Typical 
execu Ton units incEde but are not limited to. an arithmetic logic unit (ALU) 26 for executing an instruction 
fn ^ponse to two ope and s, a f.oating point anthmetic unit (FP) 30. a storage address ^e^onuM^J) 
32 anc Ta data-dependency collapsing ALU 28. An exemplary data-dependency col.aps.ng un.t ,s disclosed 

35 " ^^^Z^e^^L invention depends can be 

environment having .plurality of execution units where each execution un.t executes a scalar inst^ction o 
Xnatively a compounded scalar instruction. Further, compounded instructions can be executed ,n parallel 
?c^ ^computer system configurations. For example, compounding can be exp.orted in a mu t, 
processor environment where a compound instruction is treated as a single un.t for execut.cn by one of a 
40 oiuralitv of CPU's (central processing units). . . ^ - an 

Preferably, a computer architecture which can be adapted for hand.ing compounded '"^ctons .s an 
IBM System/370 instruction-level architecture in which multiple scalar instructions can be issued for 
Execution in each machine cycle. In this context, in the System/370 pipelined computer architecture, a 
m^nTc^e^comp*.. & of the p.pe.ine steps or stages required to execute a scalar .nstruc.on 
<s The instruction sets for various IBM System/370 architectures such as 

extended architecture (370-XA). and the System/370 Enterprise Systems Archrtecture (370-ESA are weH 
^Respecting these architectures, reference is given here to the PHnc^eeof 
System/370 (Publication #GA22-7000-lO, 1987). and to the Principles of Ope ration of he BM ^ 
Systems Architecture/370 (Publication #SA22-7200-0. 1988). Also helpful is the ^J^^g^ 
Assembly Language with ASSIST: Structured Concepts in Advanced Top.cs. by C. J. Kacmar. Prent.ce Hall. 



so 

19887 



^n general, an instruction compounding facility will look for c.asses of instructs ™ m «* l 1 ^^ 
in paranel. a^d wil. ensure that no interlocks between members of a compound 

be handled by the hardware. When compatible sequences of instructions are found, the instructs are 

" C ° m R°e"tS. an inter.ock occurs in paraHe. execution when corH^tly^ 

access to the same execution resource and no hardware means is provided for arford.ng the concurrent 
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located with haif worcs not containing the first byte or an instruction are .gnoreo. When a compounded 
oair is fetched for execution, the tag bit for the first byte of the secona instruction cr a compounded pair .s 
also ignored As a resuit. this encooing procedure requ.res only one bit or mtormation to ident.fy a 
compounded instruction to a CPU during execution of the instruction. 

As will be appreciated, when more than two scalar instructions can be grouped together to form a 
compound instruction, additional tag bits may be reouired. The minimum number of tag bits needed to 
indicate the specific number of scalar instructions actually compounded is the logarithm to the base two 
(rounded up to the nearest whole number) of the maximum number of scalar instructions that can be 
grouped to form a compound instruction. For example, if the maximum is two. then one tag bit is needed for 
each compound instruct.on. If the maximum ,s three or four, then two tag pits are neeaed for each 

compound instruction, and so on. . _ *^ 

It will be apparent to those skilled in the an that the present invention requ.res an instruct.on stream to 
be compounded only once for a particular computer system configuration, and therearter any fetch of 
compounded instructions will also cause a fetch of the tag bits associated therewith. Th.s avoid, the need 
for the inefficient last-minute determination in selection of certain scalar instructions ,or parallel execution 
that repeatedly occurs every time the same or different instructions are fetched for execution .n the so- 

called super scalar machine. _r«- 

Despite the advantage of compounding an object code instruction stream ,t becomes a difficult 
procedure to implement under certain computer architectures unless a technique is developed for determin- 
ing instruction boundaries in a byte stream. Such a determination is complicated wnen vanable length 
instructions are allowed, and is further complicated when data and instructions can be intermixed m the 
same byte stream. Of course, at execution time ,n S truction_ boundaries must be known to allow proper 
execution. But since comoounoing is preferably done a sufficient time prior to instruct.on execution, a 
technique is needed to compound instructions without knowiedge of where instructions start and w,thou 
knowledge of which bytes are data, in the example of this invention, the worst case -s assumed that .s that 
instruction lengths are variable, that data is intermixed with instructions in the byte stream being com- 
pounded, and no reference points are available in the byte stream to ident.fy instructions. As w.ll be 
appreciated, for compounoing, the absence of a reference point to identify the beginning of an -nstruction 
creates uncertainty in that many more tag bits will be generated by the compounding unit than m gh 
otherwise be necessarv. Nevertheless, the unique technique of this invention works equally well with either 
fixed or variable length instructions. Once the start of an instruction is known (or presumed), the length can 
always be found in one way or another somewhere in the instructions. In the System/370 instructions, he 
length is encoded in the first two bits of the op code. In other systems, the length may be encoded in the 
operands or implicit if all instructions are the same length. 

Operational Environment 

Referring to Fig. 4 of the drawings, there is shown a representative embodiment of a portion of a digital 
computer system or digital data processing system constructed in accordance with the P""*^"; 
This computer system is capable of processing two or more instructor* m parallel. It •^fVjJ 
storage mechanism for storing instructions and data to be processed. This storage mechan.sm identified 
as higher-level storage 36. This storage 36 (also "main memory") is a larger-capacity lower-speed storage 
mechanism and may be. for example, a large-capacity system storage un.t or the lower portion of a 
comprehensive hierarchical storage system or the like. ra ^ iuin „ 

The computer system of Fig. 4 also includes an instruction compounding mechan.sm for receiving 
instructions from the higher-level storage 36 and associating with these instructions compounding informa- 
tion in the- form of tags which indicate which of these instructions may be processed in parallel wrth one 
another. This instruction compounding mechanism is represented by instruction compounding unit 37 This 
instruction compounding unit 37 analyzes the incoming instructions for determ.n.ng which ones > may be 
processed in parallel. Furthermore, instruction compounding unit 37 produces for these analy zee ^instruc- 
tions tag bits which indicate which instructions may be processed in parallel w,th one another and wh,ch 
ones mav not be processed in parallel with one another. 

T^e Rg 4 system further includes a second storage mechanism coupled to the instruction compound- 
ing mechanism 37 for receiving and storing the analyzed instructions and their """^^ 
second or further storage mechanism is represented by compound instruction cache 38. The cache 38 w a 
smaller-capacity, higher-speed storage mechanism of the kind commonly used for -mp™"* J£ £™T 
mance rate of a computer system by reducing the frequency of having to access the lower-speed storage 
mechanism 36. 
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If access is reauirea to ccrain ccerana data from the resource, a aata-oeoenaency interlock exists if the 
data must be written cy cne instruction cefore being either read cr written oy the other instruction. An 
aaaress generation mtenccx exists if data being produced by execution of one of the instructions is required 
by a simuitaneousiy-executing instruction for address calculation. 

5 in orcer to identify instructions of a known instruction set which are ccmoatible with other instructions 

ror stmuitaneous execution, me set from which the instructions are drawn can ce broken into categories of 
instructions that may ce executed in parallel in a computer system configuration which executes all 
; nstructions of the instruction set. Instructions within certain of these categories may be compounded with 
instructions in the same category or with instructions in certain other categories. For example, the 

'0 System/370 instruction set can be partitioned into the categories illustrated in Fig. 2. The rationale for this 
categorization is based on the functional requirements of the System/370 instructions and their hardware 
utilization in a typical System/370 computer system configuration. Other instructions of the System/370 
instruction set are not considered soecifically for compounding in this exemplary embodiment. This does 
not preclude them from being comoounded by the technioue of the present invention. 

•5 For example, consider the instructions contained in category 1 compounded with instructions from that 

same category in the following instruction sequence: 

AR R1, R2 
SR R3. R4 

This sequence ts free cf cata ceoenaence interlocks and produces the following results which comprise two 
«ndeoenaent System/370 instructions: 

R1 = R1 + R2 
25 R3 = R3 - R4 

Executing such a sequence would reauire two independent and parallel two-to-one ALU's designed to the 
instruction level architecture. Thus, it will be understood that these two instructions can be grouped to form 
a compound instruction in a computer system configuration which has two such ALU's. This example of 

:o compounding scalar instructions can be generalized to all instruction sequence pairs that are free of data 
dependence interlocks, hardware dependence interlocks, and address generation interlocks. 

The fiow diagram in Fig. 3 shows the generation of a compound instruction set program from an object 
code program in accordance with a set of customized compounding rules which reflect the categories of 
Fig. 2 together with both the system and hardware architecture of a System/370 complex. Successive 

:s clocks of object code instructions are provided as a byte stream which is input to a compounding facility 
that produces compounded instructions. Successive blocks of instructions in the byte stream having 
predetermined lengths are analyzed by the compounding facility 37. The length of each block 33, 34, 35 in 
the byte stream which contains the group of instructions considered together for compounding is dependent 
on the complexity of the compounding facility. 

-o The particular compounding facility illustrated in Fig. 3 is designed to consider two-way compounding 

for "m" instructions in each block. The compounding facility 25 employs a two-instruction-wide window to 
consider every pair of instructions in each block. 

In this exemplary two-way compounding scheme, compounding information is added to the instruction 
stream as one bit for every two bytes of text, in general, a tag containing control information can be 

-is produced for each instruction in the compounded byte stream - that is for each non-compounded scalar 
instruction as well as for each compounded scalar instruction included in a pair, triplet, or larger 
compounded group. This genera! approach is employed in the example of this invention. Relatedly, the tags 
specifically identify and differentiate those compounded scalar instructions forming a compounded group 
from the remaining non-compounded scalar instructions of a block. The non-compounded scalar instructions 

50 remain in the block, and when fetched are executed alone. 

The case of compounding at most two instructions provides the smallest grouping of scalar instructions 
to form a compound instruction, and uses the following preferrea encoding procedure for the compounding 
information. Since all System/370 instructions are aligned on a half word (two-byte) boundary with lengths of 
either two, four, or six bytes, oniy one bit of compounding information need be provided for every half word. 

55 Hereinafter, the bits wnich contain the compounding information are called "tag bits" or "C bits". In this 
example, the tag bit vaiue ":ne" indicates that the instruction that begins in the byte unoer consideration is 
comoounaed with the following instruction, while a :ag bit value of "zero" indicates that the instruction that 
begins in the byte under consideration is not compounded with the following instruction. The tag bits 
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In Fig 5 there is illustrated a auaciword 50 which forms a portion of a cache line, the remainder of 
which is not'illustrated. The quadworo 50 includes four words, cenotea as WORD0-WORD3. Each word 
includes a pair of half words, each half word including two bytes of data. Each byte includes 16 bits. Bit 
positions are numbered in ascending order for the quadword from bit 0 through bit 127. 

Assume that the first naif word in WORD0 includes a conventional two-byte instruction such as would 
be found in the instruction set for the System/370. The half word instruction 52 includes 16 bits of which the 
first eight bits 0-7, form the op code. In the op code, bits 0 and 1 provide the length field code. In 
System/370 instructions, a code value of 0 indicates that the instruction is one half word long, the codes 01 
and 10 denote a double half word (four byte) instruction, and the code 11 denotes that the instruction 
includes three half words (six bytes). The two byte instruction format includes a designation of a first 
operand in bit positions 8-11 and the second operand in bit positions 12-15. These operand fields identify 
registers of a set of general purpose registers where the operands for the instruction are stored. 

Reference numeral 54 in Fig. 5 indicates the format for a double half word (four byte) instruction. In the 
double half word instruction, the first eight bits (byte 0) contain an op code with a length field code of 01 or 
10 The first four bits of the second byte of the double word (byte 1) identify the first operand for the 
instruction in the form of a register (R) in the general purpose registers. The second four bits of byte 1 in 
the double half word instruction identify an address index register (RX) in the general purpose registers, 
while the first four bits of byte 2 identify a base address register (RB). As is known, the RX and RB 
registers are used for operand address calculation. 

Instruction Compounding Unit 

For the purpose of understanding the description of the~instruction compounding unit which follows, 
instructions are provided in a cache line comprising a block of eight quad words, designated QW0-QW7. 

25 The instruction compounding unit, shown in greater detail in Fig. 6A and 6B (hereinafter "Fig. 6"). is suitable 
for use as the instruction compounding unit 37 of Fig. 1 to compound a cache line. The instruction 
compounding unit of Fig. 6 is designed for the general case in which instructions may be two, four, or six 
bytes in length, data may be interspersed in the cache line, and no reference point is provided to indicate 
where the first instruction begins. The instruction compounding unit of Fig. 6 simultaneously compounds a 

3a maximum of eight instructions, two instructions at a time, for parallel execution. In this case, a one-bit 
compounding signal is generated, with a compounding bit being generated for each half word of the line. 
. Consequently, sixty four compounding bits (C bits) will be generated for each cache line. 

To understand the operation of the instruction compounding unit of Fig. 6, consider the compounding 
rules which it implements. If d is a dependency function over two instructions, i, and i„. where j and k 

35 represent an instruction category number, i, will be referred to as the first or left instruction, while i k will be 
referred to as the second or right instruction. The dependency function d maps the dependencies between 
the two instructions being compounded into a set (A. E. <p] where A is an address generation dependency, 
E is an execution unit (data) dependency, and 6 represents no dependencies, that is. .ndependent 
execution. 

40 Consider a compounding function C over two instructions being compounded. Given a value tor a tor 
these two instructions, together with a hardware requirement for each instruction, C is a binary function 
defined simply as C = 1 meaning that the instructions can be compounded, or C = 0. meaning that the 
instructions cannot be compounded. 

Consider, for example, the following code sequence: 

45 (1) AR 2.3 

(2) SR 4.2 

(3) AR 2.3 

(4) SR 4.5 

(5) SRL 6.1(0) 
50 (6) AR 6.5 

(7) AR 2.6 

instructions (1) and (2) may be compounded using two execution units (EU2 and EU3) to calculate R2 
= R2 + R3 and R4 = R4 - (R2 + R3). In this regard. EU2 is an execution unit which collapses the 
55 interlock between the instructions by performing a 3-to-l compound operation. Such an execution unit is 
taught in co-pending Patent Application Serial No. 07/504,910. Over instructions (1) and (2). C - 1 and d - 

Instructions (3) and (4) may be compounded using EU2 and EU3 to calculate R2 = R2 + R3 and R4 
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The Fig. 4 system further induces a plurality of functional instruction processing units which operate in 
parallel with one another. These functional instruction processing units are represented by functional units 
39. 40. 41. et cetera. These functional units 39-41 operate in parallel with one another in a concurrent 
manner and eacn. on its own, is capaoie of processing one or more types of machine-level instructions. 
5 Examples of functional units which may be used are: a general purpose arithmetic and logic unit (ALU), an 
acdress generation type ALU, a aata dependency collapsing ALU (per co-pending Application Serial No. 
071,504.910 (IBM Docket EN 990 014), a branch instruction processing unit, a data shifter unit, a floating- 
point processing unit, and so forth. A given computer system may include two or more of some of these 
types of functional units. For example, a given computer system may include two or more general purpose 
-o ALU's. Also, no given computer system need include each and ever/ one of these different types of 
functional units. The particular configuration of functional units will depend on the nature of the particular 
computer system being considered. 

The computer system of Fig. 4 also includes an instruction fetch and issue mechanism coupled to 
compound instruction cache 38 for supplying adjacent instructions stored therein to different ones of the 
;5 functional instruction processing units 39-41 when the instruction tag bits indicate that they may be 
processed in parallel. This mechanism also provides single instructions to individual functional units when 
their tag bits indicate parallel execution is not possible. This mechanism is represented by instruction fetch 
and issue unit 42. Fetch ana issue unit 42 fetches instructions from cache 38, examines the tag bits and 
instruction operation cede (op cede) fields ano. based upon such examinations, sends the instructions to the 
20 aoproonate ones of the functional units 33-41. 

A stream of instructions is brought in from auxiliary storage devices by known means, and stored in 
blocks called "pages" in the main memory 36. Sets or continuous instructions called "lines" are moved 
from the main memory 36 to the compound instruction cache 38 where they are available for high-speed 
reference for processing by the instruction fetch and issue unit 42. Instructions which are fetched from a 
25 cache are issued, decoded at 42. and dispatched to the functional units 39-41 for execution. 

During execution, when reference is made to an instruction which is in the program, the instruction's 
address is provided to a cache management unit 44 which uses the address to fetch one or more 
instructions, including the addressed instruction, from the instruction cache 38 into a queue in the unit 42. If 
the addressed instruction is in the cache, a cache "hit" occurs. Otherwise, a cache "miss" occurs. A cache 
30 miss will cause the cache management unit 44 to sena the line address of the requested instruction to a 
group of storage management functions illustrated collectively as a memory management unit 45. These 
functions use the line address provided by the cache management unit 44 to send a line of instructions 
("cache line") to the compound instruction cache 38. 

In the context of SCISM architecture, in-cache instruction compounding is provided by the instruction 
35 compounding unit 37 so that compounding of each cache line can take place at the input to the compound 
instruction cache 38. Thus, as each cache line is fetched from the main memory 36 into the cache 38, the 
line is analyzed for compounding in the unit 37 and passed, with compounding information, for storage in 
the compound instruction cache 38. 

Prior to caching, a line is compounded in the instruction compounding unit 37 which generates a set or 
*o tag bits. These tag bits may be appended directly to the instructions with which they are associated, or may 
be provided in parallel with the instructions. In any case, the bits are provided for storage together with their 
line of instructions in the cache 38. As needed, the compounded instructions in the cache 38 are fetched 
together with their tag bits by the instruction fetch and issue unit 42. As the instructions are received by the 
fetch and issue unit 42. their tag bits are examined to determine if they may be processed in parallel and 
45 their operation code (op code) fields are examined to determine which of the available functional units is 
most appropriate for their processing. If the tag bits indicate that two or more of the instructions are suitable 
for processing in parallel, then they are sent to the appropriate ones in the functional units in accordance 
with the codings of their op code fields. Such instructions are then processed concurrently with one another 
by their respective functional units. 
50 When an instruction is encountered that is not suitable for parallel processing, it is sent to the 

appropriate functional unit as determined by an op code and it is thereupon processed aione and by itseif in 
the selected functional unit. 

In the most perfect case, where plural instructions are always being processed in parallel, the instruction 
execution rate of the computer system would be N times as great as for the case where instructions are 
55 executed one at a time, with N being the number of instructions in the groups which are being processed in 
oarallel. 
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13. Categories 1 ana 13 
C = 1 

14. Categories 1 and 14 

C = 0 if d = A; C = 1 otherwise 

15. Categories 1 and 15 

C = 0ifd = A;C = 1 otherwise 

16. Categories 1 and 16 

C = O if d = A; C = 1 otherwise 

17. Categories 1 and 17 

C = Oifd = A;C=l otherwise 



C = Oifd = A;C-l otherwise instruction oair in which the first instruction of 

„ z -^^rizs^z;;^ :rr« ^ — - — - 

ducted according .0 « compa f ity and a compiet, s.t o, 

An instruction compounding un, such as mat , ^ ^ 6f6iicn c! ms instrucS o„ 

compounding ^J^.^^ ™S%Z of two-.nstruction compounding empioying the 

compounding unit or Fig. o is presented m 

exemplary rules for category 1 compounding given above. 

Detailed Description of the Instruction Co mpounding Unit 

•■<ev.fi ;nHnHf»«; a sixteen-bvte bus 60 corresponding to the storage 
The instruction , ccmoounon, , unn = °' ^6^^ ^^ ord< frQm the main memory 36 to the 
bus in Fig. 4, wn.ch transters a cacne Ime ^a Q wora ^ 3 unjt 61 . when latching trie 

instruction cache 38. Each ouaoworo on the bus 60 s atoned ,n a s *J ^ q( me 

current quadword on the bus 60. the stagm , un.t 61 ^<£Z££° Compounding analysis, including 
quadword -and the two most s.gn.ncant woros °' *° * rst 

r 6 te-~! ST S ~,^n „ tne options - - — n compounding uni, o, R g. 

6 - •» • . fr,,.r rpni^tprs 75 76 77 and 73. Each of the registers is capable of 

In Fig. 6. the stag.ng unit includes four registers 75 ,6 a 76 either 

storing one half of a quadword. that ,s adou*w«d ^^^^^77 are designated as the L2LO 
from the bus 60. or from the output ot the register ,8. The registers ,6 ana y 
and L2HI registers, respectively, while the registers 75 and 73 are denoted 

Preferabiy. quadwords are forwarded to the cache from the reg, ^^J^ dQub(e word in bits 0 . 63 

into the registers 76 and 77. the last double woro or ™ v £ d . bit posit j ons o-63 of the 

the Si register. When the second quad word of a -me - — — ™ — ^ 

is retained in 

first quadword is loaded into the S2 register 78 from the L2LO register 'b. mis 

„ s\ tegiste, un, ^TSSS J^riS^-^^ 

the L2LO register 76. n n „rnr *«vis accordina to the instruction 

. R . ler n„„,oFig r 7,o^_ 
compounding unit of figure 6. Bits 64.127 of quaowora ■ a forrn jna word 82. and half words 

75. Re.ated.y. these positions are occupied by two half words 80 and si , ng bit posil ions of 

M and 85 forming word 86. Bit positions 0:63 of quadword « + 1 are in the corresponcmg w 
SlO reJtlrVs. lit positions oai are occupied by ha.f words ^^^^t^ ha., word 
Recail that the worst case compounding process requires that a C M be generated 
, in an instruction byte stream. Therefore, the ins » uct 'V^7 s ^ 

compounding bit for each of the ha.f words of the <^^^?^Zw l™^ < Six ^ e 
C bits ,t is assumed that eacT , ha word is potent , ^J* t ^ by the inventors that 
instructions are not compounded in this example, aimuuy 
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= R4 - R5. No cecencency exists between the instructions, therefore C = 1 ana a = o. 

For instructions (5) and (6), a = E, but C = 0 because the interlock cannot be collapsed as the 
execution unit hardware cf instruction t'6) is cerinea. Instructions i7} and (8) cemonstrate an address 
generation ceoencency: according to the comoounaing ruies implemented by the instruction compounding 
s unit of Fig. 6, C = 0 because a = A. 

The following symooiogy is used fcr consiaenng two potentially comcoundable instructions: 
oo1 rl,r2 ;first or left instruction 

op2 r3,r4.(r5) ;secona or right instruction 
In this symbology, the designation cp refers to the op code found in the first byte of each instruction, 
;o while the designations ri, r2 are registers in the register fields of the first instruction and r3, r4, (and 
possibly r5) are the registers in the fields of the second (and possibly third) byte of a second instruction. 

Now, considering the symbology described above, if r4 is used as an addressing operand, as for 
exampie in the BCTR and SCR instructions of the System/370 instruction set, ri = r4 is considered an 
address generation dependency. The designations op1 and op2 are generic in that they may refer to an 
75 instruction of any format. The r fields are generally applied to two or four-byte instructions of well-known 
formats. 

Comoounaing Ruies 



20 The rules fcr comoounaing category 1 instructions in an exemplary instruction set such as the 

System/370 instruction set are given ceiow. These ruies are implemented in the compounding unit of Fig. 6 
and permit the comoounaing of fixea-point with fixed-point instructions and fixed-point with floating-point 
instructions. The categories are those cesignated in Fig. 2. 

25 Categon/ 1 Ruies: 

1 . Categories 1 and 1 
C = 1 
Exceptions 

30 C = 0 for the following: 

1. op1 = any, op2 = any, and r1 = r3 = r4 

2. op1 = {AR, SR, ALR, SLR}. op2 = {LPR, LNR}, and r1 = r4 

2. Categories 1 and 2 

C = lifd = <£>;C=0 otherwise 
35 3. Categories 1 and 3 

1. If op 2 = {BCT.3CTR}, then C = 1 if d = {E.p}; C = 0 otherwise 

2. If op2 = {BXH,BXLE}, then C = 1 if d = o 
4. Categories 1 and 4 

C = 0 

40 5. Categories 1 and 5 
C = 0 
Exceptions 

1. If op 1 = any and op2 = {BASR} then c = 1 if d = {E,<?} <p; C = 0 otherwise 

2. If opl = any and op2 = {BAS} then C = 0 if d = {A} C = 1 otherwise 
45 6. Categories 1 and 6 

C = 1ifd=<*>;C=0 otherwise 

7. Categories 1 and 7 
C=Oifd=A;C = i otherwise 

8. Categories 1 and 8 

50 C = 0 if d = A; C = 1 otherwise 

9. Categories 1 and 9 
C=0ifd=A;C = l otherwise 

10. Categories 1 and 10 
C=0ifd=A;C = l otherwise 

55 11. Categories 1 and 1 1 
C = 1 

12. Categories 1 ana 12 
C = 1 
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TRUNCATE -ianal cutout by the FSM 66 is low. the C bits input into circuit elements 90. 91 and 92 are 
forwarced through those elements to the oad-numcerea latches or the register 64 

The instruction compounding unit in Fig. 6 is designee to correctly perrorm compound.ng for an 
arbitrarily rotated cache line, observing the following conditions: 
5 , No compounding occurs across cache lines. That is. the last instructs in QW7 of a cache l.ne .s not 
romoounded with the first instruction in QWO of a following cache line; 

2 Up to the Vst three C bits for a line, that is the C bits for the last three half words of QW7. are 
truncated by being forced to 0. in view of condition (1): and 

3 If the cache line has been rotated such that a quadword other than QWO is received first, then 
:o compounding analysis is performed for instructions lying on the boundary between last and first 

q rnTde7toS e mp e ound between the last and first quadwords of a rotated cache line, the S2 register 78 
receives the first four half words from the first quadword loaded from the bus 60 and retams mem until he 
last quadword has been received, at which time, contents of the 32 register 73 are gated through the 

15 66 is of conventional design and responds to the fo.-owing input 

SI9n FIRSTQW which is asserted when the first quadword of the cache line is placed on the bus 60; 
LASTQW which is asserted when the last quaaword of the cache line is on the bus 60; 
■k, EOL (End of Line), which is assened when QW7 is on the bus 60; ana 

NUMFQW. which is the number (0 to 7) of the first quad word transferred on the bus 60. and which ,s 

^tSZ sTna's - "produced by the cache management unit 44 (Fig. 4) in the course of a protocol which 
controls transfer of cache lines from the high level storage 36 to the compound instruction cache 38 .n 

25 response to a cache miss. «r«H..^o<: thA 

The finite state machine 66 which controls the instruction compounding un.t in F,g. 6 produces the 

following signals: 

LD_L2. signifying load the L2LO and L2HI registers; 
LD_S1. which signifies loading the S1 register; 
20 LD S2. which signifies loading the S2 register; 

GT~S2 L2LO. signifying gating of the contents of the S2 to the L2LO register; 

LD-CVR (0:15). signifying .oading of the C-vectcr register 64. Each bit of th.s signal loads _a 
corresponding four-signal register, that is. if LD_CVR (0) = 1. the register 100 is loaded; if LD CVR (1) - 
T. the ^'register 101 is loaded. Preferably, in the design illustrate in Fig. 6. two LD_CVR l.nes may be 

" ^ ^N^^ch* Activated in order to zero the C bits for instructions in the 6th. 7th and 8th half 
words in QW7. 

Ti ming of the Instruction Compounding Unit 

40 Fiqs 8A-8C show the timing of the instruction compounding unit in Fig. 6 for three representative 
rotates of incoming cache Nnes. The unit operates in a cycle of ten periods. * *™X^cZZ 
quadword line is transferred, one quadword at a time in eight successive cycles on the bus 
quadword on the line is designated as OWN. where N = 0. 17. As the quad words a^^^m^ 

45 are designated as QWNL or QWNH, where "L" signifies bits 0:63 of quadword OWN, wh.le H s.gn.fies 

^ WthTefte^now to Fig. 8A. the compounding of the non-rotated cache line will be explained In Fig. 

8A. "quadword of the cache line are sequent sent on the bus 601 J -W^J 

oresence of the first quadword of the transfer is signified by the signal FIRST_QW. which is asserted 

so S r ng c^e period 0 when QWO is on the bus 60. and which fails slight* past the ^eg.nning of penod . 
when QW1 is on the bus. While the signal F«RST_QW is valid, the FSM 66 gates in ^ e 
The NUMFQW signal initializes a nine state cycling counter to a state representmg the number ofthe _ first 
quadword on the bus. In Fig. 8A. NUMFQW has a (decimal) va.ue of 0. indicating hat quadword QWO s on 
le bus. in response to the signals F.RST_QW and NUMFQW, the FSM 66 ""^^jg^ 

55 which loads the quadwords from the bus 60 in their arrival sequence into the L2LO and L2HI reg.sters 

^e in the second cyc.e period, the FSM 66 raises the LD_Sl signaL which loads the 

with the HI double word in the L2LO register 76 in the third cycle period. Thereafter, m each remam.ng 
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instructions of any size can be compounded). A compounding oox (C30X) 62a of the rules base unit 62 
(Fig. 6) generates a C bit for the haif word 80 occupying bit positions 0:15 in the S1 register 75. The C bit 
for this naif word is generated by the application, in CSOX 62a. of the compounding rules given above. 
Thus, the CBOX 62a must first determine whether the half word 80 contains the entirety of a two-byte 

5 instruction or the first haif of a four-byte instruction. The CBOX 62a must also compare the operand of the 
instruction beginning in the half word SO with the succeeding instruction to determine whether each 
instruction is in a category which can be compounded with the other instruction; it must also determine 
whether any interlocks exist between the two instructions in the form of data dependency or address 
generation hazards. Thus the C30X must compare instruction op codes and operand and addressing 

>Q registers cf the two instructions. 

The CSOX 62a assumes that an instruction begins in the haif word 80. Recalling the instruction formats 
illustrated above in Fig. 5. it will be appreciated that the first 12 bits of the half word 80 will provide the 
instruction op code, the length code field of the instruction, and n. If the length field code of instruction in 
the haif word 80 decodes to a two-byte instruction, the CBOX 62a assumes that the next instruction begins 

is with the half word 81. In order to determine whether an instruction beginning in half word 81 can be 
compounded with an instruction in half word 80, the CBOX 62a must have access to the 20 bits beginning 
in haif word 81 and extending to the first 4 bits in the half word 84. These 20 bits are required in case the 
instruction beginning in the haif word 81 is 4 bytes long, in which case, the first byte includes the instruction 
op code, the second byte, designations of r3 and r4, and the following half byte, the designation of 

20 (possibly) register r5. 

Continuing with the assumption that the instruction iQ the half word 80 is a two-byte instruction, the 
CBOX 62a receives bits 0:11 of the haif word 80 at input 11 and bits 16:35 beginning with the haif word 81 
at input 121, giving it enough information to determine instruction size, op code compatibility, and any 
interlocks. 

25 Assuming that the lengtn field code in bits 0:1 of the haif word 80 indicate that the instruction is four 

bytes long, the CBOX 62a must have access to the 20 bits beginning with the haif word 84, since the haif 
word 81 is included in the instruction beginning in the half word 80. These 20 bits are obtained from register 
positions 32:51 of the S1 register, which embrace all of the haif word 84 in the first four bits of the haif word 
85. The 20 bits for the second instruction following a four byte instruction are applied at 122 of the CBOX. 

30 Attention is drawn to the fact that determination of the compoundability of an instruction beginning in a 

half word 81 with the following instructions requires access to the 20 bits beginning in the half word 84, and 
to the 20 bits beginning in the half word 85. However, as explained above, the 20 bits beginning in the half 
word 85 include the first 4 bits of the haif word 87 in the register 76. Therefore the input to the C80X which 
determines the compounding bit value for the half word 81 receives at its I22 input the 20 bits comprising 

35 bits 48:63 in the S1 register 75 and bits 0:3 in the haif word 87 stored in bits 0:15 of the L2LO register 76. 

Returning to the instruction compounding unit of Fig. 5, eight CBOX circuits 81-87 are shown. The 
CBOX circuits perform the actual compounding analysis according to the worst case scenario in which an 
instruction stream has variable length instructions intermixed with data and no reference to indicate where 
the first instruction of the cache line is. Since, in the System/370 example, ail instructions are aligned on 

40 haif word boundaries, a starting point for instructions is presumed, that reference point corresponding with 
bit position 0 of the first quadword received in a cache line. 

Each of CBOXs 80-87 generates a C bit for a respective one of the eight haif words contained in the S1 
and L2LO registers 75 and 76. Each box receives, at its 11 input, the first 12 bits of a respective one of the 
haif words, and at its 121 and I22 inputs, the first 20 bits beginning with the first and second half words 

45 following that which provides the 11 input. Thus, for example, the CBOX 80 corresponds to the CBOX 62a of 
Fig. 7 in that it receives the first 12 bits of the first haif word in the S1 register at its 11 input the 20 bits 
beginning with the second haif word in the S1 register at its 121 input and the 20 bits beginning with the 
third haif word in the S1 register at its I22 input. In response, the CBOX 80 generates a C bit for the first 
haif word of the S1 register. 

so The CB^IX 81 generates a C bit for the second half word of the S1 register, it is noted that input I22 of 

CBOX 81 receives the 20 bits beginning with the last half word of the S1 register (bits 48:63) and continuing 
to the first four bits of the first haif word in the L2LO register 76. Similarly. C bits are generated for the third 
and fourth haif words in the S1 register by CBOXs 82 and 83, while the CBOXs 84-87 generate C bits for 
the first second, third, and fourth half words in the L2LO register 76. 

=5 In the register 64, there are illustrated in Fig. 6. 16 separate 4-input 4-output D registers 100-115. Each 

of the even-numbered registers receives an input from each of CBOXs 84-87, while each of the odd- 
numbered registers receives an input from each of the CBOXs 80-83. In Fig. 6. C bits from the CBOXs 81. 
82, and 83. are provided through truncation elements 90, 91, and 92 respectively. For so long as a 
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cycie penoa the 51 register 75 receives the naif word loaaea in the L2HI register 77 during the previous 

cycle period until the LD S1 signal falls. In the second cycie period. FSM 66 also pulses the LD S2 

signal loaaing into the S2 register the lower couple word of the first quadword received on the L2 bus 60. 
When the last quadwora is being placed on the L2 bus 60. the LASTQW signal input to the FSM 66 is 
5 activated. In response, in the ninth cycie period of the compounding process, the FSM generates the 

G7___S2 L2LO signal, gating the contents of the S2 register in the L2LO register 73 in the tenth cycle 

period. The last quadword of the iine. that is QW7, is signified to the FSM 66 by the EOL signal. This signal 
is latched by the FSM 66 for one period, representee] by the EOLLTH signal which is internal to the FSM. In 
the cyc:e period following the EOLLTH signal, the FSM 66 activates the TRUNCATE signal and deactivates 

>o the LD L2 and LD_S1 signals. 

Therefore, for an unrotated cache iine, quadwords are placed on the bus 60 in each of a sequence of 
eight cycie periods. In all, a ten cycie period defines the sequence for latching quadwords of the cache line 
and generating C bits for every half word in the line. Initially, in cycle period 0, QWO is placed on L2 bus 
60. In cycie period one, QWO is latched in the staging unit 61, with its lower double word QW0L in the L2LO 

75 register 77 and its upper double word QVV0H in the L2HI register 76. In the cycie period 2, the double word 
GW0L is latched in the S2 register 78, where it is held until cycle period 8. At the same time, the next 
quadword, QW1 is latched into the registers 76 and 77, whiie the contents of the L2HI register 76 are 
transferred into the S1 register 75. The seauence of entering the quadword into the registers 76 and 77 and 
transferring the high couble word of the previous word into register 75 is repeated for cycle periods 3-8. In 

20 the last cycie perioa. the contents of register 73 are transferred back into the register 76, while the high 
ooubie word of the previous cycie is transferrea into register 75. 

C bits are generated by the ccmpounaing unit 62 and latched into the CVR register 64 in cycle periods 
1-9. In cycie period 1. C bits are generated only for the four half words in the register 76, while in cycie 
periods 2-3, C bits are generated ana latched for the haif words in the registers 75 and 76. In cycie period 

25 9, C bits are generated only for the Si register 75. Activation of the TRUNCATE signal forces the C bits for 
the last three haif words in QW7H to 0. 

Latching of the C bits generated in the sequence described above can be understood with reference to 
the LDCVR and NUMFGW signals in Fig. 8A. The NUMFQW signal is a three-bit signai which is valid whiie 
the FIRST-QW signai is active. The decimal value represented by the digits of the signal correspond to the 

30 number of the first quadword being transferred. For the unrotated line in Fig. 8A, the value is 0 (decimal). 
The FSM 66 uses the value of NUMFQW to initialize a state sequence having nine states. During the first 
and ninth states of the sequence, oniy one LDCVR signal is generated; during the other seven states, two 
LDCVR signals are generated. In Fig. 8A, LDCVR signals are given as a hexadecimal representation of the 
16-bit LDCVR signal. Each hexadecimal digit represents four consecutive bits of the LDCVR signal. The first 

35 hexadecimal digit represents LDCVR bits 0-3, the second, bits 4-7, the third, bits 8-11, and the fourth, bits 
12-15. Each bit of the LDCVR signal loads the correspondingly-numbered 4-bit CVR register. Thus, for 
example, LDCVR0, when active, loads the 4-bit CVR register 100. whiie LDCVR1 1, when active, loads the 4- 
bit CVR register 111. In cycie period 1 of Fig. 8A, the hexadecimal representation of the LDCVR signal is 
8000. This means that the first hexadecimal digit has the vaiue 'MOOO". Thus, the load signal for the 4-bit 

•*o register 100 is active, meaning that the C bits for the half words in the L2LO register 76 are being latched 
into the CVR register. In cycie period two, the first digit of the hexadecimal number is "6" while all the other 
digits are "0". Decoding the first digit gives the binary number "01 10". Relatediy, the ioad signals for the 4- 
bit registers 101 and 102 are active. The 4-bit register 101 receives the C bits generated by the CBOXs 80- 
83 for the half word in the S1 register 75, which is QW0H in cycie period two. Similarly, the 4-bit register 

45 102 is loaded with the C bits generated in CBOXs 84-87 for QW1L. The sequence of Fig. 8A proceeds 
through cycle periods 3-3 with the C bits generated by compounding across the quadword in the S1 and 
L2LO registers 75 and 76 being captured in the appropriate pair of 4-bit CVR registers, in cycle period 9, 
the last hexadecimal digit of the LDCVR signal has a value of T corresponding to the binary value of 
"0001 which loads the 4-bit CVR register 115 with the final four C bits for the cache line. 

so Fig. 3B illustrates the quadword loading and C bit generation cycle in the case where a cache line has 
been rotated to piace the last quadword, QW7, first on the bus-60. In this case, the EOL signai is concurrent 
with the FIRST__QW signai. Consequently, the EOLLTH signal is generated internally to the FSM 66, 
delaying the EOL signai for one cycie period and resulting in the generation of the TRUNCATE signai 
during cycie period two. The TRUNCATE signai prevents the compounding of the last three half words in 

£5 GW7H with any instruction. As described below, such compounding is prevented by forcing the C bits for 
the last three half words of QW7H to 0. However, the lower double word in GW7. that is QW7L, is retained 
in the S2 register 78 until cycie period nine when it is entered into the L2LO register 76 for compounding 
with the instructions in QW6H. The initial value of NUMFQW synchronizes the generation of the LDCVR 
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=icnais with the order or tne rotatea cacr.e sine. 
" Fig. 8C illustrates the ten-penoa cyae fcr ccmoounamg a rotatea cacr.e une in wmcn the first quad 

*ord is neither QWO nor QW7. „„„„ T . . . , . u „ 

Fig Q A and 9B (hereinafter "Fig. 9") snows a partial design for a CBOX. Tne design is partial in that 
-my compounding rules for category 1 instructions are shown. Such compounding is instructive since 
category 1 is the worst-case category and places an upper bound on the des.gn compiex.ty of a CBOX. The 
skilled artisan will be able to derive corresoonding logic which implements compounding rules for 

categories 2-12. . . 

The inputs to the C30X are 11 (0:1 1). the first twelve bits of the first half word in a oa.r of instructions. 
Following this, this half word wiil be referred to as "instruct.on 1 ". As discussed above in connection with 
Fig 7 these bits contain the oo code and rl fields of the half word being considered for compounding. 
Because instruction 1 can be either a two-or four-byte instruction, two choices are possible for the second 
instruction (12): if instruction 1 is a single half word, (bits 0:1 = "00"). then instruct.on 2 comes from the 
next half word following instruction 1. This corresponds to input 121 (0:19). As discussed above, instruction 2 
may be a four-byte instruction, in which case the first 20 bits of the instruction text are required 'or 
comoounding analysis. If bits 0:1 of 11 = "01", "10". or "11" instruction 2 comes from input I22 (0:19). 
These are the first twenty bits in the second half word following instruction 1. 

Once the instruction length of instruction 1 is determined, instruction 1 and instruction 2 are decoded by 
decode blocks (DEC) as required. In this regard, the decode blocks simply decode the instruction op codes, 
producing an active outDut only if the op code corresponds with a predetermined category op code pattern 
-mployed by the aecoce block. At the same time, the first operand of instruction 1 is compared w.th the 
potential operand and address register fields of instruction 2 to determine whether any data or address 
generation interlock exist. Deoendency indications are combined with the op code decoding in a manner 
which imolements the compounding rules given above. The signal generated by the logic of Fig. 9 (termed 
"category 1" logic) is a signal CMP_C1. which is asserted if instruction 1 is in category 1 and 
compoundable with instruction 2. This signal is combined with signals CMP_C2 through CMP_C17. which 
correspond to instruct.on 1 being in a category from 2 through 17. The final result is the C bit output which 
is asserted if instruction 1 compounds with instruction 2. 

Returning now to Fig. 9 and referring to instruction 1 as "11" and instruction 2 as "12", the first 12 bits of 
H are received at input A5. Bits 0:1 of 11 are fed to the input of OR gate 200 whose output is activated if 
either of these bits is set. Either bit being set signifies that 11 embraces more than two bytes. An inactive 
output of the OR gate 200 signifies that 11 is a two-byte instruction. The output of the OR gate 200 controls 
a multiplexer 201. If the output of the OR gate 200 is inactive, input A3 is output by the multiplexer 201. The 
input at A3 is the 121 input which constitutes bits 0:19 from the half word immediately following 11. 
Otherwise if the output of the OR gate 200 is activated, the input at A4 is selected by the multiplexer 201. 
As illustrated, the input at A4 is 122. constituting the first 20 bits (0:19) of the second half word after 11. The 
op code portion (bits 0:7) of 11 is decoded in three decoders 210a, 210b. and 210c. All of these decoders 
decode category 1 instructions. Further, decoder 210b decodes either an AR or an ALR instruction, while 
decoder 21 0c decodes an SR or an SLR instruction. 

The op code of the half word selected by the multiplexer 201 is fed to a bank of decode blocks 212a 
and 212b If the op code satisfies the decoding condition of one of the blocks, the decoding block will 
activate. The decoding block conditions are listed in Table I. For example, if 12 has an op code which is 
decoded as a branch on count, the decoder denoted as I = BCTR will activate its output. 
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Comparisons of register fields for 11 and 12 are performed in comparison (CMP) blocks 214-217. These 
comparisons are for the purpose of identifying dependencies which may constitute interlocks. Each of these 
blocks compares register rl identified in bits 8:11 of 11 with the contents of the register field locations of 12. 
If the compared values are unequal, the output of a CMP block is active; if equal, the output is inactivated. 

30 In this regard, bits 8:11 of 12 correspond with register r3, bits 12:15 with register r4, and bits 16:19 with 
register r5. The comparison block 217 is provided to compare register rl with only the first three bits of the 
r4 register field of 12, This comparison is used to detect execution dependencies between 11 and a BXH or 
BXLE instruction where bits 12:15 identify an even register but the instruction makes provision . for 
comparison with an adjacent register with an odd number. In this case equivalence of bits 8:10 of 11 and 

35 bits 12:14 of 12 will signify equivalence of the register r1 with either of the odd or even registers designated 
in the r4 field of 12. This, of course, indicates an execution interlock. 

In Fig. 9, the remaining logic up to and including the OR gate 251 is provided for combining the register 
field comparisons with op code indications to determine whether 11 and 12 are instructions which can be 
compounded. If compoundable. the output of the OR gate 251 is asserted, which will result in activation of 

*o the C bit for the half word identified as 11. 

With reference to the compounding rules given above, the remainder of the logic in Fig. 9 will be 
explained. In the first rule, the category 1 instruction is compoundable with another category 1 instruction, 
with two exceptions. The first exception is when r1 is equal to both r3 and r4. This condition is tested in the 
OR gate 220. connected to comparison blocks 214 and 215. The output of the OR gate 220 is fed, together 

45 with the output of the decoder 210a and the decoder in the decoder bank 212 which decodes I = C1 to the 
AND gate 221. If the condition exception is not met, the output of the AND gate 221 will be asserted, 
indicating that the first exception to the compounding of two category one instructions does not apply. The 
second exception is listed above and occurs when the op code of II identifies an AR, an SR. an ALR, or an 
SLR instruction, the op code of 12 identifies an LPR or an LNR instruction, and r1 = r4. The 11 op codes for 

so this instruction are tested in the OR gate 222. while the AND gate 223 tests the concurrence of the 11 and 12 
op cede exceptions. Thus, if the output of the AND gate 223 is asserted, the op codes for 11 and 12 indicate 
instructions in the respective exception classes. The output of the AND gate 223 is combined in the AND 
gate 224 with the output of the comparator block 215. If n = r4, the output of this block will be inactive, 
which will keep the output of the AND gate 224 from activating. If the comparator block 215 is active, 

55 indicating inequality of the registers, the output of the AND gate 224 wiil activate, indicating that the 
conditions of the exception have not been met. The outputs of the AND gates 221 and 224 are forwarded 
through the OR gate to the OR gate 251. 

The AND gate 227 tests for compounding according to rule 2 of the compounding rules. Thus the gaie 
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, ^-^^ rf '? - -n r=*-ccrv 2 ana rl does not ecuai 

s activated if the co ccce cf 11 is .n category i. .r.e oo c.ce - .2 .- .n -.-^ , 

* The OR gate 233 - 3 - itst-o 

^ ^^"pT^ 

.nstructidn. address generate dependency occurs ,, 0ccllrrence 0 , me last exceoidn of the 

gate 230 receives inputs from corn^on bl ocks 2,5 .nd 2£ Occur e c ^ ^ 

compounding rule 3 „ . r 5 . or execution dependency 

rn,«ucnon in wn.cn case address enea p ^ ^ ^ r< ^ ^ , f „ |$ s c 

^rucuoo.V * a oa,e 9 o^ 3 ins«,c,L. and none o, ft. exceptions to rule 3 occur. « output o, the OR 
^.egpr^nd^ instructions are no. cpmppunded. It 11 is a category , instruction and ,2 is a categt*, 
" m ^1ttpr 8 r,ed dyt; otgr^'witn « two exceptidns to r„,e , Peing tested, respectivety 

"rT."?? an^—emed. respectively. Py ANO gates 24,. 242. 245, and 246. 
Rules 5. a. ana - * 24Q 24g _ 25Q anQ 252 Rules 11-13 

eX S2S n n 8 s 1 The OR aate rlci vesTh outp s of the' AND gates 247-250 and 252. and the outputs 
nave no exceptions, i he OR gate receives n o p combine d with the output of the 

Tor ^cS. <° 10-17. The output of the AND 

9ate The 4 OR f «» 251 ccS the results of testing ,1 and ,2 according to the category , ru.es. The output 
of the OR aate 251 is omb Led with outputs of groups of CBOX logic which apply appropnate ru.es 
7 l 9 n nr cases where 11 is in any one of categories 2-17. The output of all category rule log.c 
fcXteTin the OR ^"ose output It B1 provides the C bit for the half word identified as ,,. 



Truncation 



Referrina now to Fiqs 6 8A and 88, the truncation of compounding for the last three half words in QW7 
SS^T. I^^t^tle sense p, ft. TRUNCATE 

s ,gna, „ inacuve. ft. C £ ^ sTZ^SL^^Jl, 

oT.t=td,n~;2T i™"^ S-T« « ----r ^»r 

tho i^t half word in the S1 register with the first or second half word in the LZLU register 
activation c^f the TRUNCATE s^nal. inverted at the input to the . AND gate. ~s -e 
output of that gate and forcing the C bit for the half word at the 11 .nput of the CBOX I to zero. - 
to-last and last half words of QW7 are truncated in the same manner as the f.rst by AND gates 91 and 90. 
respectively. 

A Scalable Compound Instruction Set Machine Architecture 

Referrinq to Fig 10. there is shown a detailed example of how a computer system can be coveted 

suppled to and stored into the compound instruction cache 412. The fetch anc Ussue unrt ^460 fetches *e 
JtLtions'and the,r tags from cache 412. as needed, and arranges for the.r J^^^^u. 
one or ones' of a plurality of functional instruction process.ng un.ts 461. 462. 463 and 464 I =etc 
unit 460 examines the tags and op code fields of the fetched .nstrucuons. " the tags * 
successive instructions may be processed in parallel, then fetch and ,ssue ^™'*^£Z££ d in 
appropriate ones of the functional units 461-464 as determined by the.r op £ ^ed in 

oaralle. by the selected functional units. If the tags indicate that a oart.cu.ar ' n f^ " unit 2S 

a singu. J. nonparallel manner, then fetch ana issue un.t 460 ass.gns .t to a pamcu.ar function 
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determined by its cp cede ana it is processed or execuiea by itself. 

The first functional unit 461 is a branch instruction processing unit for processing branch type 
instructions. The secona functional unit 462 is a three input accress generation arithmetic and logic unit 
(ALU) which is used to calculate the storage address fcr instructions which transfer operands to or from 
storage. The third functional unit 463 is a general purpose arithmetic and logic unit (ALU) which is used for 
performing mathematical and logical type operations. The fourth functional unit 464 in the present examole 
is a data dependency collapsing ALU of the kind described in the above-referenced co-pending Application 
Serial No. 07/504.910. This dependency collapsing ALU 464 is a three-input ALU capable of performing two 
arithmetical/logical ooerations in a single machine cycle. 

The computer system embodiment of Fig. 10 also includes a set of general purpose registers 465 for 
use in executing some of the machine-level instructions. Typically, these general purpose registers 465 are 
used for temporarily storing data operands and address operands or are used as counters or for other data 
processing purposes. In a typical computer system, sixteen (16) such general purpose registers are 
provided. In the present embodiment, general purpose registers 465 are assumed to be of the multiport 
type wherein two or more registers may be accessed at the same time. 

The computer system of Fig. 10 further includes a high-speed data cache storage mechanism 466 for 
storing data operands obtained from a higher-level storage unit (not shown). Data in the cache 466 may also 
be transferred back to the higher-ievel storage unit. A cache management unit receives instruction 
addresses from the control unit 460 and either moves the addressed instruction and its tag to the unit, or 
detects a miss and begins the process of moving a cache line into the cache. 

The particular mode in which the tags accompany compounded instructions for storage in the cache 
466 is a matter of design choice. !n many of the cross-referenced applications, the tags are inserted into the 
compounded instruction stream, with each tag bit appended to the half word for which it was generated. For 
purposes of illustration, a technique for providing tag bits for storage and use with a cache line is illustrated 
in Fig. 1 1 . As Fig. 1 1 shows, instructions may occupy six, four, or two bytes. For the example of this 
invention, the compounding rules apply only to instructions of two or four bytes' length. Instructions which 
are six bytes in length are not compounded. However, tags are generated for every half word in a cache 
line. As Fig. 1 1 illustrates, the tag bits are preferably assembled into a C-vector which is separate from the 
compounded cache iine. In Fig. 11, a portion of a cache line including quadwords QWI and QWI + 1 is 
indicated by 390, whiie the accompanying tags are shown in the form of a C-vector 372. It will be obvious to 
those reasonably skilled in the art that the C-vector can be formed by parallel extraction of C bits registered 
in the CVR64 of Fig. 6. With the compounding bits vectored as illustrated in Fig. 11, there are a number of 
ways to implement their storage in cache. Figs. 12A and 12B illustrate two such ways. Figs. 12A and 12B 
both assume a quadword-wide bus. which comports with the bus 60 in Fig. 6, plus extra lines between the 
instruction compounding unit and the compound instruction cache for tags. Further, in keeping with the 
example explained above, the cache line is assumed to be eight quadwords in length, with the instruction 
compounding unit generating one compounding bit for every two bytes of text in a cache line. Thus. 64 
compounding bits are generated for each compound cache line. These bits must be accommodated in a 
cache architecture which associates the compounding bits with their respective half words. 

The simplest implementation for caching compounding bits with an associated cache line would see an 
increase in the internal word size of the processor between the cache and the instruction fetch and issue 
unit, as illustrated in Fig. 12A. This implies that the compounding bits are appended to quadwords, or 
inserted into the instruction stream at each half word. In Fig. 1 2A, a cache line organized into eight storage 
locations is illustrated. Without compounding, each location is eight bytes wide. With eight locations, a 16 
byte cache line is stored. With one compounding tag per half word, and two-way compounding, a minimum 
of one extra bit of storage for every half word of instruction text is required. Thus, eight compounding bit 
locations are required for every sixteen bytes. The implication is that the cache word size must be 
expanded from 128 to 136 bits. Fig. 12A illustrates a cache structure for two-way compounding and a 
quadword-wide cache bus. The cache bus and internal word size are expanded to 136 bits. The drawback 
to this scheme is that a new memory design is required, implying, for example, error correction for larger 
worcs. 

A second approach is illustrated in Fig. 12B and utilizes a tag cache that is separate from, but operated 
in parallel with, the instruction cache. This structure implies that tags are separate from the instruction text. 
However, as with Fig. 12A, the requirement that the tags accompany their respective instructions necessi- 
tates exoansion of the bus between the cache and the instruction fetch and issue unit. In this case, the 
internal cache word size is unchanged; however, the size of the bus berween the cache and the instruction 
fetch and issue unit must increase to accommodate parallel operation of the tag cache. The design of Fig. 
12B may be hardwired. Alternatively, a separate tag cache management unit wouid be provided. 
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Fia 13 «hows an examoie of a compounded instruction seauencs 500 wn.ch may ce processed by the 
computer sysZ o) Vg ,0. The Fig. 13 example is composed of the lowing .nstrucnon. ,n the fo.tow.ng 
seouence: Load. Add. Comoare ^ ^^ S ^ tigs for tnes e instructions are 1.1.0.1 and 0. 

TheS rV^Z*r^^ Slch accompanies the mstructions 500. Because of 
respectively. These tags are arrayed m a instruction is processed in a singular manner by 

itself, i he Ada a a * instructions are also treated as a compound instruct. . and 

' T^^T^ "™^- When these instructions are prcv.ded to the ins,,ction 

fetch/issue un.t. they are ^^^^^^^e of tne Fig. 13 instructions. The FVM column in 
The table of Fig. 14 --"arizes ^^^ n h ^° 0 \ 1 . As ^ scussed above, this field is typically 
Fig. 14 indicates the contents of the r rst s wnich contains the first operand. An 

used to identity a particular one or the general purpos g contains a condition code 

n^'™^" F ""I : n icaTes the" "of the fieid in two-byte instructions which identifies 
mask. The R/X column g four . byt e instructions, identifies the register containing the 

me second operand reg,st | r ^" ^ p . ' Vindicates the contents of the register field in a four-byte 
address index value. Tne E co urn m Rg. 14 System/370 instructions, a zero in the B 

an address disolacement value of zero. instruction of Fig 13 . the fetch/issue control unit 460 

„ , COnS ' d s e r m ^Z^lZ^^^^o. is to be processed in a singular 

determines from the tags . ° r mis Load jnstruction is to fetch an operand from storage, in 

manner by itself. Tne aeon to i be peri 1 R2 aj register . The storage 

this case the data cache 466. ana to place ™ etermined by 9 adding together the index value in 
address from which th.s operand .s \*JZ^****£^°^ D T he fetch/issue control unit 460 

miimm§m 

me c A rs'~ srs ■rrsrsr set «. ~ * r 

Z?.J*r *nri mares the result of the addition back into the R3 general purpose register. At the 
, R3 + R2 - R4 

The condition code for the result of this operation is sent to a condition code register located in .branch unit 
Si. Th n e data-dependency is coHapsed because ALU 464. tuXdoesnot hL to 

then compares this sum with R4 to determme the condition ^^J^'^ particular case, the 
, wait on the results from the ALU 463 which „ performing ^ ^^^ ^snoTsupp.ied back to 
numerical results calculated by the ALU 464 ana appearing at the output of ALU l 46 ,4 
the genera, ourpose registers 465. «n this case. ALU 464 merely sets the conditon cooe. ^ 
Consiaenng now the processing of the Brancn ,nstruct.on ana .he Store instructs sno 
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these instructions and their tags are fetched from the compound instruction cache 41 2 by the fetch/issue 
control unit 460. Control unit 460 determines r'rom the tags for these instructions that they may be 
processed in parallel with one another. It further cetermines from the op codes Of the two instructions that 
the Brancn instruction should be processea by the branch unit 461 and the Store instruction shoufd be 
5 processed by the address generation ALU 462. In accordance with this determination, the mask field M and 
the displacement field D of the Brancn instruction are supplied to the branch unit 461. Likewise, the address 
index value in register X and the address base vaiue in register 8 for this Branch instruction are obtained 
from the general purpose registers 465 and supplied to the branch unit 461. In the present example, the X 
value is zero and the base vaiue is obtained from the R7 general purpose register. The displacement value 
'0 D has a hexadecimal value of twenty, while the mask field M has a mask position value of eight. 

The branch unit 461 commences to calculate the potential branch address (0 + R7 + 20) and at the 
same time compares the condition code obtained from the previous Compare instruction with the condition 
code mask M. If the condition code vaiue is the same as the mask code value, the necessary branch 
condition is met and the branch address calculated by the branch unit 461 is thereupon loaded into an 
'5 instruction counter in control unit 460. This instruction counter controls the fetching of the instructions from 
the compound instruction cache 412. If, on the other hand, the condition is not met (that is t the condition 
code set by the previous instruction does not have a vaiue of eight), then no branch is taken and no branch 
address is supplied to the instruction counter in control unit 460. 

At the same time that the brancn unit 461 is busy carrying out its processing actions for the Branch 
20 instruction, the address generation ALU 462 is busy doing the address calculation (0 + R7 + 0) for the 
Store instruction. The address calculated by ALU 462 is supplied to the data cache 466. If no branch is 
taken by the branch unit 461, then the Store instruction operates to store the operand in the R3 general 
purpose register into the data cache 466 at the address calculated by ALU 462. If, on the other hand, the 
branch condition is met and the branch is taken, then the contents of the R3 general purpose register is not 
25 stored into the data cache 466. 

The foregoing instruction sequence of Fig. 13 is intended as an example only. The computer system 
embodiment of Fig. 12 is equally capable of processing various other instruction sequences. The example 
of Fig. 13, however, clearly shows the utility of the compound instruction information in determining which 
pairs of instructions may be processed in parallel with one another. 

30 

Considerations of Industrial Application 



The discussion above provides a hardware implementation for compounding instructions for parallel 
execution. It is asserted that this solution does not compromise the cycle time of the machine in which it is 
35 embodied. 

As the example of Figs. 12-14 shows, it can support and even simplify the control of a large number of 
functional units. As Figs. 6-11 show, the instruction compounding unit the cache configuration and the 
instruction processing architecture which result are all feasible for implementation. 

The compound instruction cache architecture gives rise to a number of distinct advantages in the 

jo industrial application of the invention. First, it eliminates the need for a software compounding facility, which 
permits the invention to be applied to existing instructions without modifying their object code forms and 
which can accommodate future codes, thereby obviating modification to compilers or assemblers. Next, the 
overhead required for storage of the compounding information is limited to the compound instruction cache. 
No overhead is imposed on any storage means standing above the cache in the* memory hierarchy: not in 

45 the semiconductor memory (main memory), in the direct access device storage, or anywhere else. Further, 
the only time a performance penalty will occur for non-sequential operations is when the target instruction 
required for the operation is not in the cache. In the case of branches, the likelihood of that occurring is 
directly related to the miss ratio of the cache. It is entirely conceivable for a compound instruction cache of 
sufficient size to contain entire program loops of compound instructions, making the branch penalties 

50 negligible. Another advantage of this architecture is the ability of seif-modifying code to be handled simply 
by trapping writes to the compound instruction stream, invalidating the cache line written to, requesting the 
updated line from the upper levels of the memory hierarchy, and recompounding the line. Last, even though 
the proposed architecture changes neither the amount nor the duration of the analysis that must be 
performed to attain a particular levei of compounding (and, thus, parallelism), the analysis is performed only 

55 when a cache miss occurs and is thus infrequent by definition: no designer wouid purposefully buiid an 
instruction cache with a high miss ratio into a high-performance computer. The compounding analysis wiil 
increase cache miss service time by some amount proportional to the degree of analysis performed. 

The first aesign consideration in developing an industrial application of this invention can be appreciated 
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v, tn re'erenc- to Fig 6. The staaina unit 61 erfect.ve.y permits comcour.air.g ever an entire ouaowora. 
■vn.ch is oreaseiy the umt of transfer between the mam memory ana the ccmpouna instruction cacne. in 
matching the size of the unit of transfer into the cache, the compounding process can consider all available 
pairs of Instructions as they are presented to the cache for storage therein. This reduces the time penalty 
<cr two-way compoundina. In the general case, the size of the staging unit >s a funct.on of the number of 
Instructions that constitute a single compound instruction and the scope of the analysis for compounding. In 
some cases it may turn out that increasing the size of the staging unit beyond a certain value may have 

diminishing returns. . , L . ... • 

The complexity of the instruction compounding unit will vary with the goals wn.ch compound.ng is 
intended to achieve. In this regard, the instruction compounding unit of Fig. 6 implements compound.ng 
rules for seventeen categories of instructions in a scheme which compounds at the maximum only ^o 
instructions. More complex compounding over, for example, three or more instructions can be accomplished 
by a compounding unit whose compounding section extrapolates the bas.c design of the CBOX .ilustrated in 
Fiq 9 Such a design may result in a more complex tag which would include control information 
compounding information, steering bits, and other information of the type typically associated with horizontal 
microcode. The creation of compounding information and the semantics imputed to the tag are limited only 
by size constraints of the design and the time penalty ascribed to cache miss servicing. Relatedly. the tag 
can be as minimal or maxima, as time and space allow. For example consider the very frequent 
System/370 instruction pair Test Under Mask (TM) followed by Branch on Condition (BC). Given. the high 
frequency of the instruction pair, compounding it alone for parallel execution can improve processor 
Performance. Shou.d a designer choose to compound only this pair, then the rules base for thecompound- 
inq unit contains only one rule, and the C30X and compounding unit become trivial. At the other extreme 
the rules base may contain rules for subset, but still a substantial part of, a complete instructoon-set 
architecture. It may additionally contain further information pertaining to the physical properties of the 
functional units, facilitating the embedding of control information in the tags. The rules base, though 
imolementable in hardwired, random logic, may be implemented in some form of fast-access programmable 
storage, thereby allowing for flexibility as more functional units are added or subtracted, more or fewer 
types of compoundings are desired, or even as the computing environment changes. Relatedly. certain 
compoundings may be more advantageous in a commercial environment than in an engineer.ng-sc.ent.fic 
environment, or vice versa. This implies that the rules base can be programmable, with rules decisions 
being made at machine configuration time. Therefore, the inventors contemplate that, instead of being 
hardwired the CBOX functions of the instruction compounding unit could be implementea in a fast-access, 
multi-ported memory which is programmable with a desired set of rules at the time a machine is 

manufactured. t , , r„« ...;«-»^,.f 

Proposals have been made for decreasing the cache miss ratio by prefetching cache l.nes wrthout 
waiting for a cache miss. If the cache management unit were designed to prefetch the next-sequential line 
of instructions, it would be possible to hide much of the time required by the instruction compounding unit 
for compounding. The fraction of all line compoundings that are hidden will be determined in this case by 
the program instruction-fetching behavior, as well as the organization for the compound instruction cache. 

Certain specific design decisions have been incorporated into the discussion above for the purpose of 
presenting examples. Thus, this invention may be practiced by incorporation of C bits directly into the 
instruction stream at each half word boundary. Further, compound instructions could simply issue .directly 
from the cache, rather than employing an instruction fetch/issue control unit with a buffer or stack. Also, 
when a cache miss and subsequent line fetch occur, it may be beneficial from a performance standpoint to 
pass the instruction addressed for execution directly to the functional units for execution at the scalar rate, 
rather than stall the functional units while the line is analyzed for compounding. 

Therefore, while we have described what are considered to be preferred embodiments of th« invention, 
it will be obvious to those skilled in the art in view of all of the considerations discussed ab ^* v ™°» s 
changes and modifications may be made to the invention without departing from its sp.nt. Thereto* the 
invention and this description are intended to cover all changes and modifications as fall within the spirit and 
scope of the appended claims. 



Claims 



1. In a digital computer system caoable of processing two or more instructions in parallel, the combination 
comprising: 

a larger-capacity, lower-speed storage mechanism for storing instructions to be processed: 
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• ■ ^ „ — ri p- r;r; rrP c --a functional instruction crccessmg units when their 
instructions storea tr.eretn .0 cirr^fii one- ^ it-r.uuu 

compouncing information inaicates mat they may ce processea in carai.e.. 

8. The comoination of claim 7 wherein the further storage mechanism is a small-capacity, high-speed 
5 cache storage mechanism. 

9 m a diaitai computer system including means for executing two or more instructions and a main 
memory' and cache for storing instructions, a method for processing instructions for parallel execution, 
the method comprising the steps or: 

•0 

storing a plurality of instructions in the main memory; 

obtaining a sequence of instructions from the main memory for execution: 

(5 in response to the sequence of instructions, generating compounding information signifying parallel 

execution of at least two instructions in the sequence of instruct.ons: and 

storing the sequence of instructions and the compounding information in. the cache. 

-0 10 The method of Claim 9, wherein the plurality of instructions are in object code format and the step of 
gen e ^,nc.udes generating the ccmoouncing information without altering the object code format of 

the seauence of instructions. 

11. The method of C!a,m 9 or 10. wherein the step of storing in the cache inclubes storing the 
25 compounding information only in the cache. 

12 The method of one of claims 9 to n . wherein the digital computer system includes means for P^ding 
a cache m.ss signal when an instruct-on to be executed is not in the cache, the method further 
deluding the step of repeatedly, until a cache m.ss occurs, (1) obtaining from the cache at . .east two 
instructions of the sequence of instructions together with compounding information for the two 
Instructions and (2) in response to the compounding information, executing the at least two instructions 
in parallel. 

13 The method of Claim 12. further including, when a cache miss occurs, executing the steps of obtaining 
' a sequence of instruct.ons. generating compounding information, and storing the sequence of instruc- 
tions and the compounding information in the cache. 

14 The method of one of claims 9 to 13. where.n the digital computer system includes means for providing 
a cache miss signal when an instruction to be executed is not in the cache, the step of obtammg 
including fetching the instruction sequence in response to the cache miss signal. 
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15. The method of Claim 14, further including the steps of: 
prefetching a sequence of instructions from the main memory; 

<S in response to the prefetched sequence of instructions, generating compounding information, signifying 

parallel execution of at least two adjacent instructions in the prefetched sequence of instructions; and 

storing the second sequence of instructions and the compounding information in the cache. 

50 

16. The method of one of claims 9 to 15, further including the step of: 

concurrently with the generating step, executing an instruction of the sequence of instructions. 
55 17. The method of one of claims 9 to 16 further including the steps of: 
executing an instruction of the sequence of instructions; 
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a smaller-capacity, higher-speea storage mecnanism for storing instructions with associated tag 
information: 

and an instruction compouncing mechanism coupied between the iower-speed storage mechanism and 
the higher-speea storage mecnanism for receiving instructions from the iower-speed storage mecha- 
nism, for analyzing these instructions and proaucmg tag information which indicates which instructions 
may ce processea in parallel with one another and for supplying these instructions and associated tag 
information to the higher-soeed storage mechanism for storage therein. 

The combination of claim 1 wherein the higher-speed storage mechanism is a cache storage 
mechanism. 

The combination of claim 1 or 2 wherein the tag information is comprised of a plurality of tags, a 
different one of which is associated with each instruction analyzed by the instruction compounding 
mechanism. 

The combination of one of claims 1 to 3 wherein the computer system includes a plurality of functional 
instruction processing units wnicn operate in parade! with one another and the tag information is used in 
issuing two or more instructions rrcm the higher-soeed storage mechanism to different ones of the 
functional units. 

The combination cf claim 3 or 4 wnerein the instruction compounding mechanism includes: 

a plural-instruction instruction register ;cr receiving a plurality of successive instructions from the lower- 
speed storage mechanism; 

a plurality of rule-based instruction analyzer mechanisms, each of which analyzes a different pair of 
side-by-side instructions in the instruction register and produces a compoundability signal which 
indicates whether or not the two instructions in its pair may be processed in parallel; 

and a tag generating mechanism responsive to the compoundability signals for generating the 
individual tags for the different instructions in the instruction register. 

The combination of claim 5 wherein: 

the computer system has a particular instruction processing configuration: 

and each instruction analyzer mechanism includes logic circuitry for implementing rules which define 
which types of instructions are compatible for parallel execution in the particular instruction processing 
configuration used for the computer system, such logic circuitry producing the compoundability signal 
for that analyzer mechanism. 

In a digital computer system capable of processing two or more instructions in parallel, the combination 
comprising: 

a first storage mechanism for storing instructions to be processed; 

an instruction compounding mechanism for receiving instructions from the first storage mechanism and 
associating with these instructions compounding information which indicates which of these instructions 
may be processed in parallel with one another; 

a further storage mechanism coupied to the instruction compounding mechanism for receiving and 
storing the instructions ana their associated compounding information; 

a oiuraiity of functional instruction processing units which operate in parallel with one another; 

and an instruction issue mechanism coupied to the further storage mecnanism r'or supplying adjacent 
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as a result of executing {he instruction, altering the seauence or instructions; 
obtaining the altered sequence of instructions; and 
repeating the generating and storing steps. 
18. An apparatus for processing a sequence of compiied instructions for paraifel execution, comprising: 
buffer means for receiving a sequence of compiied instructions; r 
categorization means responsive to the sequence of compiled instructions for: 

determining whether two or more instructions beiong to predetermined categories of instructions; and 

providing first signals conditioned to indicate to which predetermined categories the two or more 
instructions belong: 

interlock means responsive io the seauence of compiled instructions for providing second signals 
conditioned to inaicate whether interlocks exist between the two or more instructions: 

compounding signal means connected to the categorization and interlock means and responsive to the 
first and secona signais for providing a compounding signal conditioned to indicate whether the at least 
two instructions belong to instruction categories compatible for execution in parallel and are interlock- 
free; and 

storage means connected io the compounding signal means for receiving a plurality of compounding 
signais for the sequence of instructions during operation of the compounding signal means. 

19. The apparatus of Claim 18. wherein the compounding signal means includes rule means responsive to 
the first and second signals for: 

testing the conditions of the first and second signais according to a set of rules, the set of rules 
establishing which categories of instructions can be executed in parallel and establishing exceptions to 
those rules resulting from interlocks. 

20. The apparatus of Claim 18 or 19. wherein the sequence of instructions includes a portion of a cache 
line. 

21. The apparatus of one of claims 18 to 20, wherein the categorization means includes means for 
decoding instructions. 

22. The apparatus of one of claims 18 to 21, wherein the compounding signal means includes a 
programmable device for being selectively programmed to test the first and second signals according 
to the set of rules. 

23. The apparatus of one of cfaims 18 to 22 further including a high-speed instruction memory means for 
storing the sequence of compiled instructions and compounding signals for execution. 

24. In a computer system including a plurality of execution units which execute instructions singly and in 
parallel, a method of processing instructions for parallel execution, including the steps of: 

generating a sequence of instructions for execution; 

prior to execution of the sequence of instructions, generating information signais indicating that at least 
two instructions of the sequence can be executed in paraiiei; 



storing the sequence of instructions and the information signals in a storage device to provide fast 
access for execution; and 
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executing instructions of the seauence of instructions. 
25. The method of Claim 24, further including: 

during and after the executing step, retaining the sequence of instructions ana the information signals in 
the storage device. 

26 In a computer system with means for executing instructions singly and in parallel and a memc- neans 
for receiving a sequence of instructions for imminent execution, a combination for =roce. 3 se- 
quences of instructions for parallel execution according to a set of rules which establish ir,..ruction 
conditions for parallel execution of instructions, the combination comprising: 

a means for receiving a sequence of instructions, the sequence of instructions forming a part of a 
computer program: 

rule base means connected to the means for receiving for: 

comparing grouos of instructions in the seauence of instructions to determine whether the instructions 
of the groups of instructions satisfy rules of the set of rules; and 

generating compounoing signals ccnaitionea to indicate instructions of a group of instructions which 
can be executed in parallel; and 

storaae means connected to the rule base means and to the memory means for accumulating a 
plurality of compounding signals for storage in the memory means with the sequence of .nstruct.ons. 
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